This dataset is a collection of aligned versions of professionally produced studio speech recordings and recordings of the same speech on common consumer devices (tablet and smartphone) in real world environments. It consists of 20 speakers (10 female and 10 male) reading 5 excerpts each from public domain books (which provides about 14 minutes of data per speaker).
The initial recordings were done in a professional recording studio. Multiple versions were created from these initial recordings. In the first version, a professional sound engineer applied audio effects to create production quality speech. In the other versions, the initial recordings were played through a high quality coaxial loudspeaker in real world environments and recorded onto consumer devices. This was done with 2 devices in a number of real world acoustic environments, yielding a total of 12 versions of the device recordings. A more detailed description of the creation of the dataset is in this paper:
To download the entire dataset, click on "20 original" in the download options on the right. To download individual versions, click on ZIP. A brief description of the dataset is below. Additionally, the following two .zip files are provided:
- sample - A set of files to demonstrate the dataset. It consists of all studio and device recordings of one script of one speaker. It also contains an Adobe Audition session file, which puts all versions of the sample into a multitrack editor.
- supplementary_files.zip - Scripts read by the speakers as well as a set of Matlab files to assist in creating new device recordings.
Description of Data:
Each version of the entire set of recordings (100 wave files) is in a separate .zip file. The file naming convention for the studio recordings is as follows:
For example, f2_script4_produced.wav, is the professionally produced version of the second female speaker reading the fourth script.
The versions of the studio recordings are as follows:
- cleanraw - Original clean studio recording, which includes speech as well as non-speech sounds such as breaths, mouth sounds, and the sound of clothes.
- clean - A version of cleanraw with most of the non-speech sounds removed.
- produced - A version of clean with aesthetic effects and processing applied. This is the final studio version.
The file naming convention for the device recordings (obtained by playing "clean" through a high quality loudspeaker in a real world environment and recording it onto a device) is as follows:
For example, m5_script1_ipad_office1.wav, is the iPad recording in the first office of the fifth male speaker reading the first script.
The devices used are as follow:
- ipad - An iPad Air was placed on a stand to simulate a person holding it. This recording was done in all rooms.
- ipadflat - An iPad Air was placed flat on a table. This recording was done in two rooms.
- iphone - An iPhone 5S was placed on a stand to simulate a person holding it. This recordings was done in three rooms.
The rooms are as follows:
- office1 - more reverberant office
- office2 - less reverberant office
- confroom1 - smaller conference room
- confroom2 - larger conference room
- livingroom1 - relatively reverberant living room with occasional traffic noise from outside
- bedroom1 - bedroom with occasional traffic noise from outside